Representing discourse for automatic text summarization via shallow NLP techniques

نویسندگان

  • Laura Alonso
  • Silvia Ventosa
چکیده

In this thesis I have addressed the problem of text summarization from a linguistic perspective. After reviewing some work in the area, I have found that many satisfactory approaches to text summarization rely on general properties of language that are reflected in the surface realization of texts. The main claim of this thesis is that some general properties of the discursive organization of texts can be identified at surface level, that they provide objective evidence to support theories about the organization of texts, and that they can be of help to improve existing approaches to text summarization. In order to provide support for this claim, I have determined which shallow cues are indicative of discourse organization, and, of these, which fall under the scope of the natural language processing techniques that are currently available for Catalan and Spanish: punctuation, some syntactical structures and, most of all, discourse markers. I have developed a framework to obtain a representation of the discursive organization of texts of interand intra-sentential scope. I have described the nature of minimal discourse units (discourse segments and discourse markers) and the relations between them, relating theoretical explanations with empirical descriptions, systematizing discursive effects as signalled by shallow cues. Based on the evidence provided by shallow cues, I have induced an inventory of basic meanings to describe the discursive function of relations between units as signalled by shallow cues. The inventory is organized in two dimensions of discursive meaning: structural (continuation and elaboration) and semantic (revision, cause, equality and context). In turn, each dimension is organized in a range of markedness, with a default meaning that is assigned to unmarked cases. I have shown that the proposed representation contributes to improve the quality of automatic summaries in two different approaches: it was integrated within a lexical chain summarizer, to obtain a representation of text that combines cohesive and coherence aspects of texts, and it was also one of the components of the analysis of text in an e-mail summarizer. Finally, some experiments with human judges indicate that this representation of texts is useful to explain how some discursive features influence the perception of relevance, more concretely, to characterize those fragments of texts that judges tend to consider irrelevant. Experiments with automatic procedures for the analysis of texts correlate well with the perception of relevance observed in human judges.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cunha towards discourse parsing in Spanish

texts can be analysed from different perspectives. one of the most difficult phenomena to process is discourse structure (hovy 2010). in recent years, one of the main challenges in the field of natural language processing (nlp) has been discourse parsing. research on this topic has been done for several languages, such as Japanese (Sumita et al. 1992), english (marcu 2000) and portuguese (pardo...

متن کامل

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

Systematic literature review of fuzzy logic based text summarization

Information Overloadrq  is not a new term but with the massive development in technology which enables anytime, anywhere, easy and unlimited access; participation & publishing of information has consequently escalated its impact. Assisting userslq    informational searches with reduced reading surfing time by extracting and evaluating accurate, authentic & relevant information are the primary c...

متن کامل

A Scalable Summarization System Using Robust NLP

We describe a scalable summarization system which takes advantage of robust NLP technology such as corpus-based statlshcal NLP techmques, information extractmn and readily available on-hne resources The system attempts to compensate for the bottlenecks of traditional frequency-based, knowledge-based or discourse-based summanzatlon approaches by uhhzlng features derived by these robust technique...

متن کامل

Recent developments in temporal information extraction

The growing interest in practical NLP applications such as text summarization and question-answering places increasing demands on the processing of temporal information in natural languages. To support this, several new capabilities have emerged. These include the ability to tag events and time expressions, to temporally anchor and order events, and to build models of the temporal structure of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005